Methods and apparatus for distributed training of a neural network

ABSTRACT

Methods, apparatus, systems and articles of manufacture for distributed training of a neural network are disclosed. An example apparatus includes a neural network trainer to select a plurality of training data items from a training data set based on a toggle rate of each item in the training data set. A neural network parameter memory is to store neural network training parameters. A neural network processor is to generate training data results from distributed training over multiple nodes of the neural network using the selected training data items and the neural network training parameters. The neural network trainer is to synchronize the training data results and to update the neural network training parameters.

FIELD OF THE DISCLOSURE

This disclosure relates generally to artificial intelligence computing,and, more particularly, to methods and apparatus for distributedtraining of a neural network.

BACKGROUND

Neural networks are useful tools that have demonstrated their valuesolving very complex problems regarding pattern recognition, naturallanguage processing, automatic speech recognition, etc. Neural networksoperate using artificial neurons arranged into layers that process datafrom an input layer to an output layer, applying weighting values to thedata along the way. Such weighting values are determined during atraining process.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a representation of two images having different toggle rates.

FIG. 2A is a diagram representing a distribution of power available at acompute node.

FIG. 2B is a diagram representing an alternate distribution of poweravailable at a compute node when an increased toggle rate causesincreased memory power consumption.

FIG. 3 is a diagram representing a chain of causality that leads toincreased compute time at a compute node.

FIG. 4 is a block diagram of an example system constructed in accordancewith the teachings of this disclosure to enable distributed training ofa neural network.

FIG. 5 is a block diagram of an example node of the example system ofFIG. 4.

FIG. 6 is a diagram illustrating various amounts of compute time basedon toggle rates and available central processing unit (CPU) power.

FIG. 7 is a flowchart representative of example machine readableinstructions which may be executed to implement the example trainingcontroller of FIG. 4 to control distributed training of a neuralnetwork.

FIG. 8 is a flowchart representative of example machine readableinstructions which may be executed to implement the example trainingcontroller of FIG. 4 to generate groups of training data.

FIG. 9 is a flowchart representative of example machine readableinstructions which may be executed to implement the example node ofFIGS. 4 and/or 5 to perform a training iteration.

FIG. 10 is a diagram illustrating various amounts of compute time basedon the use of balanced toggle rates described in connection with FIGS.7, 8, and/or 9.

FIG. 11 is a flowchart representative of example machine readableinstructions which may be executed to implement the example node ofFIGS. 4 and/or 5 to select a power state for use in a trainingiteration.

FIG. 12 is a diagram illustrating various amounts of compute time basedon the use of balanced toggle rates and selected power states describedin connection with FIGS. 7, 8, 9, and/or 11.

FIG. 13 is a block diagram of an example processing platform structuredto execute the instructions of FIGS. 7 and/or 8 to implement the exampletraining controller of FIG. 4.

FIG. 14 is a block diagram of an example processing platform structuredto execute the instructions of FIGS. 7, 8, 9, and/or 11 to implement theexample node of FIGS. 4 and/or 5.

The figures are not to scale. Wherever possible, the same referencenumbers will be used throughout the drawing(s) and accompanying writtendescription to refer to the same or like parts.

DETAILED DESCRIPTION

Neural networks can be utilized for many different tasks including, forexample, image recognition tasks, text classification tasks, etc. Insome examples, image data is fed to a series of convolution and pooling(e.g., down-sampling) layers, which have the combined effect ofextracting features from the image while at the same time reducing thespatial resolution. The output of the final convolution/pooling layer isthen fed to a series of layers, which in the end produce a probabilitydistribution across a set of classification labels. Such probabilitydistribution labels can then be used to classify other images.

Neural networks as described above produce useful output if the neuronshave been trained (e.g., assigned a suitable set of weights and biasesthat connect each layer). The process of arriving at accurate weightsand/or biases (e.g., training parameters) is a computationally expensiveprocess. In many cases, the parameter space is enormous. For example,the training process may involve finding a global minimum (or a closeapproximation thereof) of a function with millions of parameters. Insome examples, a stochastic gradient descent (SGD) approach is used todetermine the training parameters of a neural network. During training,the SGD approach uses small batches (mini-batches) of pre-labeledtraining data (typically somewhere between 32 and 1024 items) providedto the neural network, quantifying how accurately the neural network isable to classify the input via a differentiable loss or error function(E). This process is called forward propagation. The gradient of theloss function is then calculated with respect to the current trainingparameters. Using the gradients, training parameters are updated.

Each neural network layer is a differentiable function of the layer thatprecedes it. Thus, the gradients are computed layer-by-layer, movingfrom output to input, in a process called backward propagation. Finally,the weights in the network are adjusted according to the computedgradients, and the process is repeated with a fresh batch of trainingdata until the network has reached a satisfactory degree of accuracyrelative to the ground truth.

Training of a neural network is an expensive computational process. Suchtraining often requires many iterations until a global minimum error isreached. In some examples, millions of iterations of the SGD processmight be needed to arrive at the global minimum error. Processedserially, such iterations may take days, or even weeks, to complete. Toaddress this, compute clusters are utilized to distribute the processingto multiple nodes to reduce the overall amount of processing time.

A natural extension of SGD, called synchronous SGD, allows training tobe divided across multiple nodes. An iteration of synchronous SGD acrossN nodes is implemented by executing N forward/backward propagationpasses in parallel, one on each node, with each node processing amini-batch of training data. In examples disclosed herein, themini-batches of training data are different among each respective node.In parallel with backward propagation (as backward propagation proceedslayer-by-layer), the nodes synchronize the gradients computed againsttheir local mini-batches. In some examples, the gradients aresynchronized in an “all-reduce” fashion. Each node takes the sum ofthese gradients, and applies it to its locally held copy of the weightsand biases as usual; in this way, each node is assured to be workingwith the same weights and biases throughout the process. The synchronousSGD algorithm then continues with a new set of N mini-batches, repeatinguntil convergence is reached.

An example disadvantage of the synchronous SGD algorithm is asusceptibility to a “straggler” problem. Due to the synchronous natureof the synchronous SGD approach, the system as a whole cannot workfaster than its slowest node. In some examples, one SGD iteration maytake between two hundred and five hundred milliseconds. Thus, if a firstnode were to complete the training iteration in two hundredmilliseconds, and a second node were to complete the training iterationin five hundred milliseconds, the first node waits three hundredmilliseconds for the second node to complete its training iteration,before synchronizing and proceeding to the next training iteration. Thispresents a significant challenge for scaling deep learning trainingacross multiple nodes.

Not only does the imbalance of compute times increase the overall amountof time to train the neural network, the imbalance is a source of energyinefficiency. For example, the fastest nodes find themselves in a“hurry-up-and-wait” or “race-to-the-ground” power consumption situation.

A contributing factor to the differences encountered in processing timesamong the nodes is an imbalance in toggle rates in the training data. Asused herein, a toggle rate is a metric representing an amount ofvariation in a training data item that results in variations in powerconsumption. In some examples, transistor switching activity increaseswith increased toggle rate, contributing to higher power consumption.FIG. 1 is a representation of two images having different toggle rates.A first image 110 is an image of a night scene along a road. A secondimage 120 is an image of a crowded city street. The first example image110 has a low toggle rate, as the first image exhibits little variationfrom pixel to pixel. As a result, there is low transistor switchingactivity in connection with the first example image 110 and,consequently, lower power consumption. In contrast, a second image 120has a high toggle rate, as a result of the many fine details anddifferent pixel values from pixel to pixel. A higher toggle rate resultsin an increased amount of power being required from the node on whichthat data is processed.

FIG. 2A is a diagram 200 representing a distribution of power availableat a compute node. The example diagram 200 includes a first section 210representative of an amount of central processing unit powerconsumption, a second section 220 representing an amount of memory powerconsumption, a third section 230 representing an amount of other powerconsumption, and a fourth section 240 representing an amount of unusedpower consumption. In the aggregate, the first section 210, the secondsection 220, the third section 230, and the fourth section 240 representa total amount of power available at a compute node. While powerallocated to a memory system, a central processing unit, and othercomponents of the node may fluctuate over time, a total power utilizedby the node cannot exceed the total amount of power available to thecompute node. For example, the first section 210 representing CPU powerconsumption may be adjusted at the node such that the power consumed bythe node does not exceed the total power available to the node. That is,power consumed by the CPU may be throttled (e.g., reduced) toaccommodate situations where the amount of power required by the memory(e.g., section 220) and/or other components (e.g., section 230) wouldexceed the total amount of available power.

FIG. 2B is a diagram 201 representing an alternate distribution of poweravailable at a compute node when an increased toggle rate causesincreased memory power consumption. As described below in connectionwith the chain of causality of FIG. 3, in some examples, memory powerconsumption (section 220 of FIG. 2A) may become increased as a result ofhigh data toggle rates (section 221 of FIG. 2B). In some examples, thismay utilize all remaining power available to the node and, in someexamples, may attempt to draw more power than is available to the node.To account for such overages, the CPU may be throttled. The examplediagram 201 of FIG. 2B includes a first section 211 representing thethrottled CPU power consumption, a second section 221 representing theincreased memory power consumption, and a third section 231 representingan amount of other power consumption.

FIG. 3 is a diagram representing a chain of causality that leads toincreased compute time at a compute node. In some examples, a togglerate of training data may increase (Block 310). This increase in thetoggle rate increases the amount of memory power consumption (Block320). That is, the second section 220 of FIG. 2A becomes enlarged (seethe second section 221 of FIG. 2B). In some examples, the increasedamount of memory power consumption may cause the available CPU power todecrease (Block 330) (see the first section 211 of FIG. 2B). When theamount of CPU power is decreased, a processing frequency achieved by theCPU decreases (Block 340). When the processing frequency is reduced, andthe amount of compute time to perform a training task increases (Block350).

Such a chain of causality may result in a few practical effects. Forexample, nodes may take different amounts of compute time to processtraining data, and depending on the data toggle rate, some nodes maycomplete their iteration of the synchronous SGD earlier while others“straggle”. Also, nodes that complete earlier (since they are processingtraining data with low data toggle rates) tend to run at a higher CPUfrequency and consume more energy/power than nodes that process trainingdata with higher toggle rates, which are forced move to lower powerstates and finish later.

Example approaches disclosed herein seek to normalize the amount of timetaken for each training iteration by better distributing the trainingdata among the nodes. Moreover, example approaches disclosed hereinimprove energy utilization for nodes that would have otherwise completedtheir training iteration early.

FIG. 4 is a block diagram of an example system 400 constructed inaccordance with the teachings of this disclosure to enable distributedtraining of a neural network. The example system 400 of the illustratedexample of FIG. 4 includes a training controller 410 and a computecluster 420. The example training controller 410 controls trainingoperations of nodes in the cluster 420. In the illustrated example ofFIG. 4, the example training controller 410 includes a training datastore 430, a toggle rate identifier 435, a training data sorter 440, atraining data grouper 445, a node controller 450, a centralized trainingparameter data store 460, and a node interface 470.

The example training data store 430 of the illustrated example of FIG. 4is implemented by any memory, storage device and/or storage disc forstoring data such as, for example, flash memory, magnetic media, opticalmedia, etc. Furthermore, the data stored in the example training datastore 430 may be in any data format such as, for example, binary data,comma delimited data, tab delimited data, structured query language(SQL) structures, etc. While in the illustrated example the trainingdata store 430 is illustrated as a single element, the example trainingdata store 430 and/or any other data storage elements described hereinmay be implemented by any number and/or type(s) of memories. In theillustrated example of FIG. 4, the example training data store 430stores labeled training data that can be used by the example nodes whentraining a neural network. In some examples, the example training datastore 430 stores a resultant model of the neural network. In examplesdisclosed herein, the training data includes data (e.g., images,documents, text, etc.) and tags and/or other classification informationassociated with the data.

The example toggle rate identifier 435 of the illustrated example ofFIG. 4 is implemented by a logic circuit such as, for example, ahardware processor. However, any other type of circuitry mayadditionally or alternatively be used such as, for example, one or moreanalog or digital circuit(s), logic circuits, programmable processor(s),application specific integrated circuit(s) (ASIC(s)), programmable logicdevice(s) (PLD(s)), field programmable logic device(s) (FPLD(s)),digital signal processor(s) (DSP(s)), etc. The example toggle ratedeterminer 435 determines a toggle rate for each item in the trainingdata. In examples disclosed herein, the toggle rate corresponds to anamount of data variance in the data. In examples disclosed herein, thetoggle rate is represented by a number between zero and one half (0.5).However, any other approach to representing an amount of data variancein a data item may additionally or alternatively be used.

The example training data sorter 440 of the illustrated example of FIG.4 is implemented by a logic circuit such as, for example, a hardwareprocessor. However, any other type of circuitry may additionally oralternatively be used such as, for example, one or more analog ordigital circuit(s), logic circuits, programmable processor(s), ASIC(s),PLD(s), FPLD(s), DSP(s), etc. The example training data sorter 440 ofthe illustrated example of FIG. 4 sorts the items in the training datastore 430 by their corresponding toggle rate. In examples disclosedherein, the items are sorted in ascending order based on theircorresponding toggle rates. However, any other sorting approach mayadditionally or alternatively be used.

The example training data grouper 445 of the illustrated example of FIG.4 is implemented by a logic circuit such as, for example, a hardwareprocessor. However, any other type of circuitry may additionally oralternatively be used such as, for example, one or more analog ordigital circuit(s), logic circuits, programmable processor(s), ASIC(s),PLD(s), FPLD(s), DSP(s), etc. The example training data grouper 445 ofthe illustrated example of FIG. 4 groups the sorted training data storedin the training data store 430 into groups. The example training datagrouper 445 determines a number of items to be included in each group,and determines a number of groups of training items to be created. Inexamples disclosed herein, the number of groups is calculated as thetotal number of items in the training data divided by the number ofitems to be included in each group. In some examples, the number ofitems to be included in each group may be determined based on a selectednumber of groups.

The example training data grouper 445 initializes a first index and asecond index. The example training data grouper 445 selects an item forallocation to a group based on the first index and the second index, andallocates the selected item to a group identified by the first index.The example training data grouper 445 increments the second index, andproceeds to allocate items until the second index reaches the number ofitems to be included in each group.

When the example training data grouper 445 determines that the secondindex has reached the number of items to be included in each group, thetraining data grouper 445 shuffles the items allocated to the groupidentified by the first index. Shuffling the items allocated to thegroup identified by the first index ensures that the items allocated tothe first group are in a random order. The example training data grouper445 then increments the first index, and repeats the process until thefirst index reaches the number of groups.

The example node controller 450 of the illustrated example of FIG. 4 isimplemented by a logic circuit such as, for example, a hardwareprocessor. However, any other type of circuitry may additionally oralternatively be used such as, for example, one or more analog ordigital circuit(s), logic circuits, programmable processor(s), ASIC(s),PLD(s), FPLD(s), DSP(s), etc. The example node controller 450 of theillustrated example of FIG. 4 controls operations of the nodes in thecluster 420. Once the training data has been sorted and grouped based ontoggle rates, the example node controller 450 initializes trainingparameters to be used among each of the nodes in the cluster 420. Inexamples disclosed herein, each of the nodes in the cluster 420 areinitialized with the same training parameters. However, in someexamples, different initial training parameters may be utilized at eachof the nodes. The example node controller 450 stores the initializedtraining parameters in the centralized training parameter data store460.

The example node controller 450 instructs each of the nodes in thecluster 420 to perform a training iteration. To perform a trainingiteration, each of the nodes in the cluster 420 selects a same number ofitems from the grouped training data (referred to herein as amini-batch) to be used during the training iteration, performs trainingbased on the selected mini-batch, and returns results of the training.In examples disclosed herein, one item is selected from each group.Because each group corresponds to different levels of toggle rates, andbecause each node in the cluster 420 selects a same number of items fromeach group, there will be approximately a same average toggle rate usedat each of the nodes and, as a result, a similar amount of computationtime encountered at each of the nodes.

Upon completion of the training iteration, each node in the cluster 420reports the result of the training iteration to the node controller 450.In examples disclosed herein, the result includes gradients computedagainst the local mini-batch. The example node controller 450 receivesthe training results via the node interface 470. The example nodecontroller 450 waits until training results have been received from eachof the nodes that were performing the training iteration. Once trainingresults have been received from each of the nodes, the example nodecontroller 450 updates the training parameters stored in the centralizedtraining parameter data store 460 based on the received results.

Using the reported results, the example node controller 450 thendetermines whether the training is complete. The example node controller450 may determine the training is complete when, for example, thetraining results indicate that convergence has been achieved (e.g.,gradient descent and/or error levels reported by each of the nodes arebelow an error threshold, a threshold number of training iterations ofbeen performed, etc.) When the example node controller 450 determinesthat the training is not complete, the example node controller 450synchronizes the updated training parameters to each of the nodes, andinstructs each of the nodes to perform a subsequent training iteration.The example node controller 450 continues this process until thetraining is complete.

The example centralized training parameter data store 460 of theillustrated example of FIG. 4 is implemented by any memory, storagedevice and/or storage disc for storing data such as, for example, flashmemory, magnetic media, optical media, etc. Furthermore, the data storedin the example centralized training parameter data store 460 may be inany data format such as, for example, binary data, comma delimited data,tab delimited data, structured query language (SQL) structures, etc.While in the illustrated example the centralized training parameter datastore 460 is illustrated as a single element, the example centralizedtraining parameter data store 460 and/or any other data storage elementsdescribed herein may be implemented by any number and/or type(s) ofmemories. In the illustrated example of FIG. 4, the centralized trainingparameter data store 460 stores training parameters (e.g., weightsand/or biases) that are synchronized among the nodes during training.

The example node interface 470 of the illustrated example of FIG. 4 isimplemented by a high speed fabric interface. In some examples, theexample node interface 470 may be implemented by an Ethernet interface,a Remote Direct Memory Access (RDMA) interface, etc However, any othertype of interface that enables the training controller to communicatewith the nodes in the cluster 420 may additionally or alternatively beused.

The example cluster 420 of the illustrated example of FIG. 4 isillustrated as separate from the training controller 410. However, insome examples, the training controller 410 may be implemented by one ormore of the nodes in the cluster 420. The example one or more nodes inthe cluster 420 perform training tasks based on training data providedby the example node controller 450 of the training controller 410. Anexample implementation of a node 421 is described below in connectionwith FIG. 5. In examples disclosed herein, the one or more nodes in thecluster 420 each have similar specifications (e.g., processingspecifications, memory specifications, neural network acceleratorspecifications, etc.). Having similar specifications across the nodes inthe cluster 420 facilitates better predictability of compute times whenbalanced toggle rates are used among those nodes.

FIG. 5 is a block diagram of an example node 421of the example system ofFIG. 4. The example node 421 of the illustrated example of FIG. 5includes a training controller interface 510, a neural network trainer520, a neural network processor 530, neural network parameter memory540, and a power state controller 550.

The example training controller interface 510 of the illustrated exampleof FIG. 5 is implemented by a high speed fabric interface. In someexamples, the example training controller interface 510 may beimplemented by an Ethernet interface, a Remote Direct Memory Access(RDMA) interface, etc. The example training controller interface 510enables the example node 421 to communicate with the training controller410 and or other nodes in the cluster 420. However, any other type ofinterface that enables the node 421 to communicate with the trainingcontroller 410 and/or other nodes in the cluster 420 may additionally oralternatively be used.

The example neural network trainer 520 of the illustrated example ofFIG. 5 is implemented by a logic circuit such as, for example, ahardware processor. However, any other type of circuitry mayadditionally or alternatively be used such as, for example, one or moreanalog or digital circuit(s), logic circuits, programmable processor(s),ASIC(s), PLD(s), FPLD(s), DSP(s), etc. The example neural networktrainer 520 generates a mini-batch of items to be used in training atthe node for a given iteration of the SGD process. In examples disclosedherein a mini-batch will include thirty-two to one thousand andtwenty-four items. However, any other number of items may be included inthe selected mini-batch. In examples disclosed herein, the exampleneural network trainer 520 selects an item from each group of thegrouped training data. Because items in each group have similar togglerates, when a similar number of items are selected among all the groups,each node generates its own mini-batch that, in the aggregate, has asimilar toggle rate to other nodes.

Using the generated mini-batch, the example neural network trainer 520instructs the neural network processor 530 to perform training of theneural network. During training, the example neural network trainer 520provides the mini-batch of pre-labeled training data to the neuralnetwork processor 530. The example neural network processor 530processes each item in the mini-batch and provides an output. Theexample neural network trainer 520 calculates a training error resultingfrom the processing performed by the example neural network processor530 on the selected mini-batch. The training error quantifies howaccurately the parameters stored in the neural network parameter memory540 are able to classify the input. The example neural network trainer520 then transmits, via the training controller interface 510, theresults of the training to the training controller 410. Transmitting theresults of the training to the training controller enables the resultsof all of the nodes to be aggregated and utilized when updating trainingparameters for subsequent iteration of the training process.

The example neural network processor 530 of the illustrated example ofFIG. 5 is implemented by a logic circuit such as, for example, ahardware processor. However, any other type of circuitry mayadditionally or alternatively be used such as, for example, one or moreanalog or digital circuit(s), logic circuits, programmable processor(s),ASIC(s), PLD(s), FPLD(s), DSP(s), etc. In examples disclosed herein, theexample neural network processor 530 implements a neural network. Theexample neural network of the illustrated example of FIG. 5 is a deepneural network (DNN). However, any other past, present, and/or futureneural network topology(ies) and/or architecture(s) may additionally oralternatively be used such as, for example, a convolutional neuralnetwork (CNN), a feed-forward neural network. In examples disclosedherein, the deep neural network (DNN) utilizes multiple layers ofartificial “neurons”, each of which maps n real-valued inputs xj to areal-valued output v according to Equation 1, below:

$\begin{matrix}{v = {\phi \left( {{\sum\limits_{j = 1}^{n}\; {w_{j}x_{j}}} + b} \right)}} & {{Equation}\mspace{14mu} 1}\end{matrix}$

In Equation 1, w_(j) and b are weights and biases, respectively,associated with a given neuron, and φ is a nonlinear activationfunction, typically implemented by a rectified linear unit (ReLU) thatclips negative values to zero but leaves positive values unchanged. Insome examples, deep neural networks may utilize millions or evenbillions of such neurons, arranged in a layered fashion. For example,one layer of neurons is fed input data, its output is fed into anotherlayer of neurons, and so on, until the output from the final layer ofneurons is taken as the output of the network as a whole. In someexamples, the shape of the connections between layers may vary. Forexample, a fully connected topology connects every output of layer L toeach neuron of the next layer (e.g., L+1). In contrast, a convolutionallayer includes only a small number of neurons that are swept acrosspatches of the input data. Typically, the use of convolutional layers isuseful when the input data has some sort of natural spatialinterpretation such as, for example, the pixels in an image, the samplesin a waveform, etc.

The example neural network parameter memory 540 of the illustratedexample of FIG. 5 is implemented by any memory, storage device and/orstorage disc for storing data such as, for example, flash memory,magnetic media, optical media, etc. Furthermore, the data stored in theexample neural network parameter memory 540 may be in any data formatsuch as, for example, binary data, comma delimited data, tab delimiteddata, structured query language (SQL) structures, etc. While in theillustrated example the neural network parameter memory 540 isillustrated as a single element, the neural network parameter memory 540and/or any other data storage elements described herein may beimplemented by any number and/or type(s) of memories. In the illustratedexample of FIG. 5, the example neural network parameter memory 540stores neural network weighting parameters that are used by the neuralnetwork processor 530 to process inputs for generation of one or moreoutputs.

The example power state controller 550 the illustrated example of FIG. 5is implemented by a logic circuit such as, for example, a hardwareprocessor. However, any other type of circuitry may additionally oralternatively be used such as, for example, one or more analog ordigital circuit(s), logic circuits, programmable processor(s), ASIC(s),PLD(s), FPLD(s), DSP(s), etc. In examples disclosed herein, the examplepower state controller 550 controls a power state of a node to, in someexamples, cause throttling of processing performed by the neural networkprocessor 530. When processing is throttled (by way of selecting areduced power state), compute time is increased. When throttled in acontrolled manner, compute time can be controlled such that the node onwhich the power state controller 550 operates completes processing atapproximately the same time as each of the other nodes in the cluster420. Utilizing such an approach enables reduction(s) in energyconsumption by placing into low-power state(s) those nodes that wouldhave been likely to complete processing sooner than the other nodes.

In examples disclosed herein, each node has a set of N discrete powerstates that can be set at runtime. For example, a node may have threediscrete power states that include states P0 (turbo), P1 (medium), andPn (low). In examples disclosed herein, the example power statecontroller 550 determines an average toggle rate of items in theselected mini-batch. In some examples, the toggle rate is identified byreading values associated with the items of the selected mini-batch.However, in some examples, the power state controller 550 may processthe items to determine the toggle rate. In examples disclosed herein,the example power state controller 550 determines the average togglerate by adding toggle rates corresponding to each of the items in theselected mini-batch, and dividing the sum of those toggle rates by thenumber of items in the mini-batch.

The example power state controller 550 determines a number of availablepower states. In examples disclosed herein, lower numbered power statesrepresent higher wattage power states, and higher numbered power statesrepresent lower wattage power states. For example, a power state 0represents a higher wattage power state than a power state 1. However,any other approach to arranging power states may additionally oralternatively be used.

The example power state controller 550 selects a power state based onthe average toggle rate. In examples disclosed herein, the example powerstate controller 550 selects the power state by determining a number ofavailable power states, and utilizing Equation 2, below:

[N*2*(0.5−T)]  Equation 2

In Equation 2, N represents the number of available power states, and Trepresents the average toggle rate of the mini-batch. As noted above,toggle rates are represented by a number between zero and one half. Thesquare brackets in Equation 2 represent a function for selecting aclosest integer value. Thus, the power state selection is performed byselecting a power state corresponding to a nearest integer of theproduct of the number of available power states, two, and a differencebetween the average toggle rate and one half. However, any otherfunction for selecting a power state may additionally or alternativelybe used such as, for example, a ceiling function, a floor function, etc.

As a result, when the toggle rate is high (e.g., approaching the maximumtoggle rate (0.5)), a low numbered power state will be selected(corresponding to a higher wattage power state). Conversely, when thetoggle rate is low (e.g., approaching the minimum toggle rate (0)), ahigh numbered power state will be selected (corresponding to a lowerwattage power state). The example power state controller 550 sets thepower state of the node 421 using the selected power state. In examplesdisclosed herein, the power state is selected by writing to a modelspecific register (MSR) that enables control of power states. Such apower state selection approach results in mini-batches with low togglerates being throttled to conserve power, in an attempt to have all ofthe nodes complete the training iteration at approximately a same time.

While in examples disclosed herein, the toggle rate corresponds to thetoggle rate of the mini-batch identified at the particular node 421, insome examples, such an approach may be implemented to considermini-batch toggle rates of other nodes. Thus, for example, if all nodeshappen to generate a low-toggle mini-batches, those nodes may operate ata high frequency, since no node is likely to “outrun” the others.

While an example manner of implementing the example training controller410 is illustrated in FIG. 4 and an example manner of implementing thenode 421 is illustrated in FIGS. 4 and/or 5, one or more of theelements, processes and/or devices illustrated in FIGS. 4 and/or 5 maybe combined, divided, re-arranged, omitted, eliminated and/orimplemented in any other way. Further, the example training data store430, the example toggle rate identifier 435, the example training datasorter 440, the example training data grouper 445, the example nodecontroller 450, the example centralized training parameter data store460, the example node interface 470, and/or, more generally, the exampletraining controller 410 of FIG. 4, and/or the example trainingcontroller interface 510, the example neural network trainer 520, theexample neural network processor 530, the example neural networkparameter memory 540, the example power state controller 550, and/or,more generally, the example node 421 of FIGS. 4 and/or 5 may beimplemented by hardware, software, firmware and/or any combination ofhardware, software and/or firmware. Thus, for example, any of theexample training data store 430, the example toggle rate identifier 435,the example training data sorter 440, the example training data grouper445, the example node controller 450, the example centralized trainingparameter data store 460, the example node interface 470, and/or, moregenerally, the example training controller 410 of FIG. 4, and/or theexample training controller interface 510, the example neural networktrainer 520, the example neural network processor 530, the exampleneural network parameter memory 540, the example power state controller550, and/or, more generally, the example node 421 of FIGS. 4 and/or 5could be implemented by one or more analog or digital circuit(s), logiccircuits, programmable processor(s), application specific integratedcircuit(s) (ASIC(s)), programmable logic device(s) (PLD(s)) and/or fieldprogrammable logic device(s) (FPLD(s)). When reading any of theapparatus or system claims of this patent to cover a purely softwareand/or firmware implementation, at least one of the example trainingdata store 430, the example toggle rate identifier 435, the exampletraining data sorter 440, the example training data grouper 445, theexample node controller 450, the example centralized training parameterdata store 460, the example node interface 470, and/or, more generally,the example training controller 410 of FIG. 4, and/or the exampletraining controller interface 510, the example neural network trainer520, the example neural network processor 530, the example neuralnetwork parameter memory 540, the example power state controller 550,and/or, more generally, the example node 421 of FIGS. 4 and/or 5 is/arehereby expressly defined to include a non-transitory computer readablestorage device or storage disk such as a memory, a digital versatiledisk (DVD), a compact disk (CD), a Blu-ray disk, etc. including thesoftware and/or firmware. Further still, the example training controller410 of FIG. 4 and/or the example node 421 of FIGS. 4 and/or 5 mayinclude one or more elements, processes and/or devices in addition to,or instead of, those illustrated in FIGS. 4 and/or 5, and/or may includemore than one of any or all of the illustrated elements, processes anddevices.

Flowcharts representative of example machine readable instructions forimplementing the training controller 410 of FIG. 4 are shown in FIGS. 7and/or 8. In these examples, the machine readable instructions comprisea program(s) for execution by a processor such as the processor 1312shown in the example processor platform 1300 discussed below inconnection with FIG. 13. The program may be embodied in software storedon a non-transitory computer readable storage medium such as a CD-ROM, afloppy disk, a hard drive, a digital versatile disk (DVD), a Blu-raydisk, or a memory associated with the processor 1312, but the entireprogram and/or parts thereof could alternatively be executed by a deviceother than the processor 1312 and/or embodied in firmware or dedicatedhardware. Further, although the example program(s) is/are described withreference to the flowchart illustrated in FIGS. 7 and/or 8, many othermethods of implementing the example training controller 410 mayalternatively be used. For example, the order of execution of the blocksmay be changed, and/or some of the blocks described may be changed,eliminated, or combined. Additionally or alternatively, any or all ofthe blocks may be implemented by one or more hardware circuits (e.g.,discrete and/or integrated analog and/or digital circuitry, a FieldProgrammable Gate Array (FPGA), an Application Specific Integratedcircuit (ASIC), a comparator, an operational-amplifier (op-amp), a logiccircuit, etc.) structured to perform the corresponding operation withoutexecuting software or firmware.

Flowcharts representative of example machine readable instructions forimplementing the example node 421 of FIGS. 4 and/or 5 are shown in FIGS.9 and/or 11. In these examples, the machine readable instructionscomprise a program(s) for execution by a processor such as the processor1412 shown in the example processor platform 1400 discussed below inconnection with FIG. 14. The program may be embodied in software storedon a non-transitory computer readable storage medium such as a CD-ROM, afloppy disk, a hard drive, a digital versatile disk (DVD), a Blu-raydisk, or a memory associated with the processor 1412, but the entireprogram and/or parts thereof could alternatively be executed by a deviceother than the processor 1412 and/or embodied in firmware or dedicatedhardware. Further, although the example program(s) is/are described withreference to the flowchart illustrated in FIGS. 9 and/or 11, many othermethods of implementing the example node 421 may alternatively be used.For example, the order of execution of the blocks may be changed, and/orsome of the blocks described may be changed, eliminated, or combined.Additionally or alternatively, any or all of the blocks may beimplemented by one or more hardware circuits (e.g., discrete and/orintegrated analog and/or digital circuitry, a Field Programmable GateArray (FPGA), an Application Specific Integrated circuit (ASIC), acomparator, an operational-amplifier (op-amp), a logic circuit, etc.)structured to perform the corresponding operation without executingsoftware or firmware.

As mentioned above, the example processes of FIGS. 7, 8, 9, and/or 11may be implemented using coded instructions (e.g., computer and/ormachine readable instructions) stored on a non-transitory computerand/or machine readable medium such as a hard disk drive, a flashmemory, a read-only memory, a compact disk, a digital versatile disk, acache, a random-access memory and/or any other storage device or storagedisk in which information is stored for any duration (e.g., for extendedtime periods, permanently, for brief instances, for temporarilybuffering, and/or for caching of the information). As used herein, theterm non-transitory computer readable medium is expressly defined toinclude any type of computer readable storage device and/or storage diskand to exclude propagating signals and to exclude transmission media.“Including” and “comprising” (and all forms and tenses thereof) are usedherein to be open ended terms. Thus, whenever a claim lists anythingfollowing any form of “include” or “comprise” (e.g., comprises,includes, comprising, including, etc.), it is to be understood thatadditional elements, terms, etc. may be present without falling outsidethe scope of the corresponding claim. As used herein, when the phrase“at least” is used as the transition term in a preamble of a claim, itis open-ended in the same manner as the term “comprising” and“including” are open ended.

FIG. 6 is a diagram 600 illustrating various amounts of compute timebased on toggle rates and available central processing unit (CPU) power.The diagram 600 of the illustrated example of FIG. 6 shows processingperformed at four nodes 614, 634, 654, 674. In the illustrated exampleof FIG. 6, each of the nodes 614, 634, 654, 674 receives training datahaving different toggle rates. As a result, different amounts ofprocessing power are available at each of the nodes, further resultingin different amounts of compute time being utilized among the nodes.

The example diagram 600 the illustrated example of FIG. 6 showsprocessing 610 performed at a first node 614 on training data having alowest toggle rate 612. In the illustrated example of FIG. 6, thetraining data having the lowest toggle rate 612 results in the node 614being able to allocate 105 watts of available CPU power for processing616. As a result, the first node 614 processes the training data 612 ina lowest amount of time 618.

The example diagram 600 the illustrated example of FIG. 6 showsprocessing 630 performed at a second node 634 on training data having alow toggle rate 632. In the illustrated example of FIG. 6, the trainingdata having the low toggle rate 632 results in the second node 634 beingable to allocate 101 watts of available CPU power for processing 636. Asa result, the second node 634 processes the training data 632 in a lowamount of time 638. In the illustrated example of FIG. 6, the low amountof time 638 is greater than the lowest amount of time 618.

The example diagram 600 the illustrated example of FIG. 6 showsprocessing 650 performed at a third node 654 on training data having ahigh toggle rate 652. In the illustrated example of FIG. 6, the trainingdata having the high toggle rate 652 results in the third node 654 beingable to allocate 97 watts of available CPU power for processing 656. Asa result, the third node 654 processes the training data 652 in a highamount of time 658. In the illustrated example of FIG. 6, the highamount of time 658 is greater than the lowest amount of time 618, and isgreater than the low amount of time 638.

The example diagram 600 the illustrated example of FIG. 6 showsprocessing 670 performed at a fourth node 674 on training data having ahighest toggle rate 672. In the illustrated example of FIG. 6, thetraining data having the highest toggle rate 672 results in the fourthnode 674 being able to allocate 94 watts of available CPU power forprocessing 676. As a result, the fourth node 674 processes the trainingdata 672 in a highest amount of time 678. In the illustrated example ofFIG. 6, the highest amount of time 678 is greater than the lowest amountof time 618, is greater than the low amount of time 638, and is greaterthan the high amount of time 658.

In the illustrated example of FIG. 6, line 690 represents an amount oftime taken for the training iteration. In the illustrated example ofFIG. 6, line 690 corresponds to the end of the training computationsperformed by the fourth node 674 (e.g., the node with the training datahaving the highest toggle rate). Thus, while the first node 614, thesecond node 634, and the third node 654 utilized more available CPUpower to complete their computations more quickly than the fourth node674, their additional effort was wasted because the first node 614, thesecond node 634, and the third node 654 had to wait for the processingof the fourth node 676 to complete the training iteration beforeproceeding to the next training iteration.

To summarize FIG. 6, the toggle rate is correlated with the amount ofcompute time required for processing of training data assigned to agiven node. Without utilizing the teachings of this disclosure, when thetoggle rate is increased, longer amounts of time may be required forprocessing the training data.

FIG. 7 is a flowchart representative of example machine readableinstructions which may be executed to implement the example trainingcontroller 410 of FIG. 4 to control distributed training of a neuralnetwork. The example process 700 the illustrated example of FIG. 7begins when the example training controller 410 accesses training data.Block 705. In examples disclosed herein, the training data is stored inthe training data store 430. However, in some examples, the trainingdata may be located in a remote location (e.g., a remote server) andmay, in some examples, be retrieved by the training controller 410. Insome examples, the training data is provided to the training controller410 without the training controller 410 actively retrieving the trainingdata.

The example toggle rate identifier 435, the example training data sorter440, and the example training data grouper 445 processes the trainingdata stored in the example training data store 430 to generate groups oftraining data. (Block 710). An example approach to generating groups oftraining data is disclosed in further detail below in connection withFIG. 8. In short, the example toggle rate identifier 435 determinestoggle rates for each item in the training data, the example trainingdata sorter 440 sorts the items by their corresponding toggle rate, andthe example training data grouper 445 arranges the training data intogroups based on the identified and sorted toggle rates. Grouping basedon sorted toggle rates enables a balanced set of training data to beprovided to and/or selected by each node during training which, in turn,results in similar computation times across the nodes.

The example node controller 450 initializes training parameters usedamong each of the nodes in the cluster 420. (Block 720). In examplesdisclosed herein, each of the nodes in the cluster 420 are initializedwith the same training parameters. The example node controller 450stores the initialized training parameters in the centralized trainingparameter data store 460.

The example node controller 450 instructs each of the nodes in thecluster 420 to perform a training iteration. (Block 730). An exampleapproach for performing a training iteration at a node is disclosedbelow in connection with FIG. 9. In short, each of the nodes in thecluster 420 selects a same number of items from the grouped trainingdata to be used during the training iteration. In examples disclosedherein, one item is selected from each group. Because each groupcorresponds to different levels of toggle rates, and each node in thecluster selects a same number of items from each group, there will beapproximately a same average toggle rate used at each of the nodes and,as a result, a similar amount of computation time used at each of thenodes. Using the selected mini-batch(es), each node in the cluster 420performs neural network training, and reports a result of the training.

Upon completion of the training iteration, each node in the cluster 420reports the result of the training iteration. In examples disclosedherein, the result includes gradients computed against the localmini-batch. The example node controller 450 receives the trainingresults via the node interface 470. (Block 740). The example nodecontroller 450 waits until training results have been received from eachof the nodes that were performing the training iteration. (Block 742).If the example node controller 450 determines that there are additionalresults of training iteration(s) to be received (e.g. block 742 returnsa result of NO), the example node controller 450 continues to collectthe results of the training iterations. (Block 740). Once trainingresults have been received from each of the nodes, the example nodecontroller 450 updates the training parameters stored in the centralizedtraining parameter data store 460 based on the received results.

In some examples, the nodes operate in a decentralized fashion and,instead of communicating their results to a centralized training datastore, communicate with the other nodes to synchronize training datausing, for example, an all-reduce algorithm. For example, each nodetakes the sum of the results of each of the nodes (e.g., the gradients),and applies the sum of the results to a locally held copy of the weightsand biases as usual. In this manner, each node is assured to be workingwith the same weights and biases throughout the process.

The example node controller 450 then determines whether the training iscomplete. (Block 750). The example node controller 450 may determine thetraining is complete when, for example, the training results indicatethat convergence has been achieved (e.g., gradient and/or error levelsreported by each of the nodes are below an error threshold, a thresholdnumber of training iterations have been performed, etc.) When theexample node controller 450 determines that the training is not complete(e.g., block 750 returns a result of NO), the example node controller450 synchronizes the updated training parameters to each of the nodes.(Block 760). The node controller 450 then instructs each of the nodes toperform a subsequent training iteration. (Block 730). The exampleprocess of blocks 730 through 760 is repeated until the example nodecontroller 450 identifies that the training is complete (e.g., untilblock 750 returns a result of YES). Upon completion of training, theneural network parameters derived via the training may be re-used in aneural network to process and/or classify other data.

FIG. 8 is a flowchart representative of example machine readableinstructions which may be executed to implement the example trainingcontroller of FIG. 4 to generate groups of training data. The exampleprocess 710 of the illustrated example of FIG. 8 represents an exampleapproach to generating groups of training data described in connectionwith block 710 of FIG. 7. The example process 710 of the illustratedexample of FIG. 8 begins when the example toggle rate identifier 435determines a toggle rate for each item in the training data. (Block805). In examples disclosed herein, the toggle rate is represented by anumber between zero and one half (0.5). However, any other approach torepresenting an amount of data variance in a training item mayadditionally or alternatively be used.

The example training data sorter 440 sorts the items by theircorresponding toggle rate. (Block 810). In examples disclosed herein,the items are sorted in ascending order based on their toggle rates.However, any other sorting approach may additionally or alternatively beused.

The example training data grouper 445 determines a number of items to beincluded in each group. (Block 815). The example training data grouper445 determines a number of groups of training items to be created.(Block 820). In examples disclosed herein, the number of groups iscalculated as the total number of items in the training data divided bythe number of items to be included in each group. While in theillustrated example of FIG. 8 the number of groups is determined basedon the number of items to be included in each group, in some examplesthe number of items to be included in each group may be determined basedon a selected number of groups.

The example training data grouper 445 initializes a first index. (Block825). The example training data grouper 445 initializes a second index.(Block 830). The example training data grouper 445 then selects an itemfor allocation to a group based on the first index and the second index.(Block 835). The example training data grouper 445 allocates theselected item to a group identified by the first index. The exampletraining data grouper 445 increments the second index. (Block 845). Theexample training data grouper 445 determines whether the second indexhas reached the number of items to be included in each group. (Block850). In examples disclosed herein, if the example training data grouper445 determines that the second index has not reached the number of itemsto be included in each group (e.g., block 850 returns a result of NO),the example process of blocks 835 through 850 is repeated until thesecond index reaches the number of items to be included in each group(e.g., until block 850 returns a result of YES).

When the example training data grouper 445 determines that the secondindex has reached the number of items to be included in each group(e.g., block 850 returns a result of YES), the training data grouper 445shuffles the items allocated to the group identified by the first index.(Block 855). Shuffling the items allocated to the group identified bythe first index ensures that the items allocated to the first group arein a random order. The example training data grouper 445 then incrementsthe first index. (Block 860). The example training data grouper 445 thendetermines whether the first index has reached the number of groups.(Block 865). If the first index has not reached the number of groups(e.g., block 865 returns a result of NO), the example process of blocks830 through 865 is repeated until the training items are allocated toeach group. If the training data grouper 445 determines that the firstindex has reached the number of groups (e.g., block 865 returns a resultof YES), the example process of FIG. 8 terminates. Control returns toblock 720 of FIG. 7, where training parameters are initialized.

FIG. 9 is a flowchart representative of example machine readableinstructions which may be executed to implement the example node 421 ofFIGS. 4 and/or 5 to perform a training iteration. The example process730 of the illustrated example of FIG. 9 begins when the example neuralnetwork trainer 520 generates a mini-batch of items to be used intraining. (Block 910). In examples disclosed herein a mini-batch willinclude thirty-two to one thousand and twenty-four items. However, anyother number of items may be included in the selected mini-batch. Inexamples disclosed herein, the example neural network trainer 520selects an item from each group of the grouped training data. Becauseitems in each group have similar toggle rates, when a similar number ofitems are selected among all the groups, each node generates its ownmini-batch that, in the aggregate, has a similar toggle rate to othernodes.

In some examples, the example power state controller 550 optionallyadjusts a power state of the processor of the node 421. (Block 920).Adjusting the power state of the processor of the node 421 enables thenode to account for minor variations in the average toggle rate of theselected mini-batch. If, for example, the average toggle rate of themini-batch were low, the node might be throttled in an effort tocomplete its computations at a same time as the other nodes (e.g., tonot finish too early as a result of using a high wattage power state).An example approach for adjusting a power state of the node 421 isdescribed further in connection with FIG. 12.

The example neural network trainer 520 instructs the neural networkprocessor 530 to perform training of the neural network. (Block 930).During training, the example neural network trainer 520 provides themini-batch of pre-labeled training data (typically somewhere between 32and 1024 images at a time) to the neural network processor 530. Theexample neural network processor 530 processes each item in themini-batch and provides an output. The example neural network trainer520 calculates a training error resulting from the processing performedby the example neural network processor 530 on the selected mini-batch.The training error quantifies how accurately the parameters stored inthe neural network parameter memory 540 are able to classify the inputvia a differentiable loss or error function (E). This process is calledforward propagation. The example neural network trainer 520 thencomputes a gradient descent value a/aw (E) of the loss function withrespect to the current weights w. (Block 945). In some examples, usingthe gradients, weights are updated according to Equation 3, below, wherew′ are the updated weights, w are the weights prior to the adjustmentprocess and θ is a tunable parameter called a learning rate.

$\begin{matrix}{w^{\prime} = {w - {\theta \frac{\partial}{\partial w}(E)}}} & {{Equation}\mspace{14mu} 3}\end{matrix}$

Since each neural network layer is a differentiable function of thelayer that precedes it, the gradients may be computed layer-by-layer,moving from output to input, in a process called backward propagation.

The example neural network trainer 520 then transmits, via the trainingcontroller interface 510, the results of the training (e.g., the weightsand/or the gradients) to the training controller 410. Transmitting theresults of the training to the training controller enables the resultsof all of the nodes to be aggregated and utilized when updating trainingparameters for subsequent iterations of the training process. Theexample process 730 of the illustrated example of FIG. 9 thenterminates. Control returns to block 740 of FIG. 7, where the examplenode controller collects results of the training iteration from thenodes in the cluster 420. The example process 730 of the illustratedexample of FIG. 9 may then be repeated to perform a subsequent trainingiteration of the training process.

FIG. 10 is a diagram illustrating various amounts of compute time basedon the use of balanced toggle rates described in connection with FIGS.7, 8, and/or 9. The diagram 1000 of the illustrated example of FIG. 10shows processing performed at four nodes 1014, 1034, 1054, 1074. In theillustrated example of FIG. 10, each of the nodes 1014, 1034, 1054, 1074receives training data having a balanced toggle rate. As a result,similar amounts of processing power are available at each of the nodes,further resulting in similar amounts of compute time being utilizedamong the nodes. While in the illustrated example of FIG. 10, the togglerates are balanced among the nodes, the mini-batches used by each of thenodes are not identical and, as a result, might have slightly differenttoggle rates, resulting in slightly different amounts of processingtime. However, the differences in processing time for a given iterationare minimized.

The example diagram 1000 the illustrated example of FIG. 10 showsprocessing 1010, 1030, 1050, 1070 performed at a first node 1014, asecond node 1034, a third node 1054, and a fourth node 1074, onmini-batches of training data having a balanced toggle rates 1012, 1032,1052, 1072, respectively. In the illustrated example of FIG. 10, becausethe toggle rates are balanced, the amount of available CPU power 1016,1036, 1056, 1076 are balanced as well, resulting in similar amounts ofcompute time 1018, 1038, 1058, 1078.

In the illustrated example of FIG. 10, line 1090 represents an amount oftime taken for the training iteration. In the illustrated example ofFIG. 10, line 1090 corresponds to the end of the training computationsperformed by the fourth node 1074. In contrast to the example diagram600 of FIG. 6, differences in compute time(s) are minimized, resultingin a reduction in the amount of time where the node is idle beforeproceeding to the next training iteration.

To summarize FIG. 10, when toggle rates of mini-batches are balanced,the resultant compute times are likewise balanced, thereby reducing theamount of time that nodes that complete early must wait beforeproceeding to the next training iteration.

FIG. 11 is a flowchart representative of example machine readableinstructions which may be executed to implement the example node 421 ofFIGS. 4 and/or 5 to select a power state for use in a trainingiteration. The example process 920 of the illustrated example of FIG. 11begins when the example power state controller 550 determines an averagetoggle rate of items in the selected mini-batch. (Block 1110). Inexamples disclosed herein, the example power state controller 550determines the average toggle rate by adding toggle rates correspondingto each of the items in the selected mini-batch, and dividing the sum ofthose toggle rates by the number of items in the mini-batch.

The example power state controller 550 determines a number of availablepower states. (Block 1120). In examples disclosed herein, lower numberedpower states represent higher wattage power states, and higher numberedpower states represent lower wattage power states. For example, a powerstate 0 represents a higher wattage power state than a power state 1.

The example power state controller 550 then selects a power state basedon the average toggle rate. (Block 1130). In examples disclosed herein,the example power state controller 550 selects the power state bydetermining a number of available power states, and utilizing Equation4, below:

[N*2*(0.5−T)]  Equation 4

In Equation 4, N represents the number of available power states, and Trepresents the average toggle rate of the mini-batch. As noted above,toggle rates are represented by a number between zero and one half. Thesquare brackets in Equation 4 represent a function for selecting aclosest integer value. Thus, the power state selection is performed byselecting a power state corresponding to a nearest integer of theproduct of the number of available power states, two, and a differencebetween the average toggle rate and one half. As a result, when thetoggle rate is high (e.g., approaching the maximum toggle rate (0.5)), alow numbered power state will be selected (corresponding to a higherwattage power state). Conversely, when the toggle rate is low (e.g.,approaching the minimum toggle rate (0)), a high numbered power statewill be selected (corresponding to a lower wattage power state). Theexample power state controller 550 then sets the power state of the node421. (Block 1140). Such a power state selection approach results inmini-batches with low toggle rates being throttled to conserve power, inan attempt to have all of the nodes complete the training iteration atapproximately a same time.

In the illustrated example of FIG. 11, the toggle rate corresponds tothe toggle rate of the mini-batch identified at the particular node 421.However, in some examples, such an approach may be implemented toconsider mini-batch toggle rates of other nodes. Thus, for example, ifall nodes happen to generate a low-toggle mini-batches, those nodes mayoperate at a high frequency, since no node is likely to “outrun” theothers.

FIG. 12 is a diagram illustrating various amounts of compute time basedon the use of balanced toggle rates and selected power states describedin connection with FIGS. 7, 8, 9, and/or 11. The diagram 1200 of theillustrated example of FIG. 12 shows processing performed at four nodes1214, 1234, 1254, 1274. In the illustrated example of FIG. 12, each ofthe nodes 1214, 1234, 1254, 1274 receives training data having abalanced toggle rate. As a result, similar amounts of processing powerare available at each of the nodes, further resulting in similar amountsof compute time being utilized among the nodes. While in the illustratedexample of FIG. 12, the toggle rates are balanced among the nodes, themini-batches used by each of the nodes are not identical and, as aresult, might have slightly different toggle rates, resulting inslightly different amounts of processing time. However, the differencesin processing time for a given iteration are minimized as a result ofthe power state control.

The example diagram 1200 the illustrated example of FIG. 12 showsprocessing 1210, 1230, 1250, 1270 performed at a first node 1214, asecond node 1234, a third node 1254, and a fourth node 1274, onmini-batches of training data having a balanced toggle rates 1212, 1232,1252, 1272, respectively. As shown in the example diagram 1000 of theillustrated example of FIG. 10, when the toggle rates are balanced, theamounts of compute time are likewise balanced. However, as also noted inconnection with FIG. 10, while the toggle rates are balanced, there canstill be minor variations in the toggle rates among each of the nodes.As discussed in connection with FIG. 11, power consumption at each ofthe nodes can be controlled to throttle an amount of compute timerequired (while reducing power consumed at the node).

In the illustrated example of FIG. 12, power consumed by the first node1216 is reduced to a first level based on the toggle rate 1212 of themini-batch used by the first node 1214. Reducing the power consumed bythe first node 1214 ensures that the first node will complete thetraining iteration at approximately the same time as the other nodes.Similar throttling is also applied at the second node 1234, the thirdnode 1254, and the fourth node 1274.

In the illustrated example of FIG. 12, line 1290 represents an amount oftime taken for the training iteration. In the illustrated example ofFIG. 12, line 1290 corresponds to the end of the training computationsperformed by the fourth node 1274. In contrast to the example diagram600 of FIG. 6, as well as the example diagram 1000 of FIG. 10,differences in compute time(s) are minimized, resulting in a reductionin the amount of time where the node is idle before proceeding to thenext training iteration.

To summarize FIG. 12, while using balanced toggle rates is beneficialfor balancing compute times, minor variations in the toggle rates canlikewise cause minor variations in the compute times. By adjusting anamount of CPU power available, those differences in compute times can befurther reduced, while also reducing an amount of power consumed by thenodes.

FIG. 13 is a block diagram of an example processor platform 1300 capableof executing the instructions of FIGS. 7 and/or 8 to implement thetraining controller 410 of FIG. 4. The processor platform 1300 can be,for example, a server, a personal computer, a mobile device (e.g., acell phone, a smart phone, a tablet such as an iPad™), a personaldigital assistant (PDA), an Internet appliance, a DVD player, a CDplayer, a digital video recorder, a Blu-ray player, a gaming console, apersonal video recorder, a set top box, or any other type of computingdevice.

The processor platform 1300 of the illustrated example includes aprocessor 1312. The processor 1312 of the illustrated example ishardware. For example, the processor 1312 can be implemented by one ormore integrated circuits, logic circuits, microprocessors or controllersfrom any desired family or manufacturer. The hardware processor may be asemiconductor based (e.g., silicon based) device. In this example, theprocessor 1312 implements the example toggle rate identifier 435, theexample training data sorter 440, the example training data grouper 445,and/or the example node controller 450.

The processor 1312 of the illustrated example includes a local memory1313 (e.g., a cache). The processor 1312 of the illustrated example isin communication with a main memory including a volatile memory 1314 anda non-volatile memory 1316 via a bus 1318. The volatile memory 1314 maybe implemented by Synchronous Dynamic Random Access Memory (SDRAM),Dynamic Random Access Memory (DRAM), RAMBUS Dynamic Random Access Memory(RDRAM) and/or any other type of random access memory device. Thenon-volatile memory 1316 may be implemented by flash memory and/or anyother desired type of memory device. Access to the main memory 1314,1316 is controlled by a memory controller.

The processor platform 1300 of the illustrated example also includes aninterface circuit 1320. The interface circuit 1320 may be implemented byany type of interface standard, such as an Ethernet interface,InfiniBand Interface, Omni-Path Architecture Interface, a universalserial bus (USB), and/or a PCI express interface. The example interfacecircuit 1320 may implement the example node interface 470.

In the illustrated example, one or more input devices 1322 are connectedto the interface circuit 1320. The input device(s) 1322 permit(s) a userto enter data and/or commands into the processor 1312. The inputdevice(s) can be implemented by, for example, an audio sensor, amicrophone, a camera (still or video), a keyboard, a button, a mouse, atouchscreen, a track-pad, a trackball, a sensor, an isopoint, and/or avoice recognition system.

One or more output devices 1324 are also connected to the interfacecircuit 1320 of the illustrated example. The output devices 1024 can beimplemented, for example, by display devices (e.g., a light emittingdiode (LED), an organic light emitting diode (OLED), a liquid crystaldisplay, a cathode ray tube display (CRT), a touchscreen, a tactileoutput device, a printer and/or speakers). The interface circuit 1320 ofthe illustrated example, thus, typically includes a graphics drivercard, a graphics driver chip and/or a graphics driver processor.

The interface circuit 1320 of the illustrated example also includes acommunication device such as a transmitter, a receiver, a transceiver, amodem and/or network interface card to facilitate exchange of data withexternal machines (e.g., computing devices of any kind) via a network1326 (e.g., an Ethernet connection, a digital subscriber line (DSL), atelephone line, coaxial cable, a cellular telephone system, etc.).

The processor platform 1300 of the illustrated example also includes oneor more mass storage devices 1328 for storing software and/or data.Examples of such mass storage devices 1328 include floppy disk drives,hard drive disks, compact disk drives, Blu-ray disk drives, RAIDsystems, and digital versatile disk (DVD) drives.

The coded instructions 1332 of FIGS. 7 and/or 8 may be stored in themass storage device 1328, in the volatile memory 1314, in thenon-volatile memory 1316, and/or on a removable tangible computerreadable storage medium such as a CD or DVD. The example mass storagedevice 1328 may implement the example training data store 430 and/or theexample centralized training parameter data store 460.

FIG. 14 is a block diagram of an example processor platform 1400 capableof executing the instructions of FIGS. 7, 8, 9, and/or 11 to implementthe example node 421 of FIGS. 4 and/or 5. The processor platform 1400can be, for example, a server, a personal computer, a mobile device(e.g., a cell phone, a smart phone, a tablet such as an iPad™), apersonal digital assistant (PDA), an Internet appliance, a DVD player, aCD player, a digital video recorder, a Blu-ray player, a gaming console,a personal video recorder, a set top box, or any other type of computingdevice.

The processor platform 1400 of the illustrated example includes aprocessor 1412. The processor 1412 of the illustrated example ishardware. For example, the processor 1412 can be implemented by one ormore integrated circuits, logic circuits, microprocessors or controllersfrom any desired family or manufacturer. The hardware processor may be asemiconductor based (e.g., silicon based) device. In this example, theprocessor 1412 implements the example neural network trainer 520, theexample neural network processor 530, and/or the example power statecontroller 550.

The processor 1412 of the illustrated example includes a local memory1413 (e.g., a cache). The processor 1412 of the illustrated example isin communication with a main memory including a volatile memory 1414 anda non-volatile memory 1416 via a bus 1418. The volatile memory 1414 maybe implemented by Synchronous Dynamic Random Access Memory (SDRAM),Dynamic Random Access Memory (DRAM), RAMBUS Dynamic Random Access Memory(RDRAM) and/or any other type of random access memory device. Thenon-volatile memory 1416 may be implemented by flash memory and/or anyother desired type of memory device. Access to the main memory 1414,1416 is controlled by a memory controller.

The processor platform 1400 of the illustrated example also includes aninterface circuit 1420. The interface circuit 1420 may be implemented byany type of interface standard, such as an Ethernet interface, auniversal serial bus (USB), and/or a PCI express interface.

In the illustrated example, one or more input devices 1422 are connectedto the interface circuit 1420. The input device(s) 1422 permit(s) a userto enter data and/or commands into the processor 1412. The inputdevice(s) can be implemented by, for example, an audio sensor, amicrophone, a camera (still or video), a keyboard, a button, a mouse, atouchscreen, a track-pad, a trackball, isopoint and/or a voicerecognition system.

One or more output devices 1424 are also connected to the interfacecircuit 1420 of the illustrated example. The output devices 1024 can beimplemented, for example, by display devices (e.g., a light emittingdiode (LED), an organic light emitting diode (OLED), a liquid crystaldisplay, a cathode ray tube display (CRT), a touchscreen, a tactileoutput device, a printer and/or speakers). The interface circuit 1420 ofthe illustrated example, thus, typically includes a graphics drivercard, a graphics driver chip and/or a graphics driver processor.

The interface circuit 1420 of the illustrated example also includes acommunication device such as a transmitter, a receiver, a transceiver, amodem and/or network interface card to facilitate exchange of data withexternal machines (e.g., computing devices of any kind) via a network1426 (e.g., an Ethernet connection, a digital subscriber line (DSL), atelephone line, coaxial cable, a cellular telephone system, etc.). Theexample interface circuit 1420 implements the example trainingcontroller interface 510.

The processor platform 1400 of the illustrated example also includes oneor more mass storage devices 1428 for storing software and/or data.Examples of such mass storage devices 1428 include floppy disk drives,hard drive disks, compact disk drives, Blu-ray disk drives, RAIDsystems, and digital versatile disk (DVD) drives.

The coded instructions 1432 of FIGS. 9 and/or 11 may be stored in themass storage device 1428, in the volatile memory 1414, in thenon-volatile memory 1416, and/or on a removable tangible computerreadable storage medium such as a CD or DVD. The example mass storagedevice 1428 implements the example neural network parameter memory 540.

Example 1 includes an apparatus for distributed training of neuralnetworks, the apparatus comprising a neural network trainer to select aplurality of training data items from a training data set based on atoggle rate of each item in the training data set, a neural networkparameter memory to store neural network training parameters, and aneural network processor to implement a neural network to generatetraining data results from distributed training over a plurality ofnodes of the neural network using the selected training data items andthe neural network training parameters, the neural network trainer tosynchronize the training data results and to update the neural networktraining parameters based on the synchronized training data results.

Example 2 includes the apparatus of example 1, further including a powerstate controller to control an amount of power consumption of a centralprocessing unit of the apparatus based on an average toggle rate of thetraining data items in the selected training data items.

Example 3 includes the apparatus of example 1, further including atoggle rate identifier to determine a toggle rate for each item oftraining data in the training data set, a training data sorter to sortthe items in the training data by their corresponding toggle rate, and atraining data grouper to allocate a first number of items of the sorteditems to a first group, and to allocate a second number of items of thesorted items to a second group, the second number of items beingsequentially located after the first number of items in the sortedtraining data, the selection of the plurality of training data itemsbeing performed among the first group and the second group.

Example 4 includes the apparatus of example 3, wherein the training datagrouper is further to shuffle the items allocated to the first groupwithin the first group.

Example 5 includes the apparatus of any one of examples 1-4, wherein thetoggle rate associated with a training data item represents an amount ofdata variance within the training data item.

Example 6 includes the apparatus of example 1, wherein the neuralnetwork is implemented as a deep neural network.

Example 7 includes the apparatus of example 1, wherein the training dataitems include image data.

Example 8 includes a non-transitory computer readable medium comprisinginstructions which, when executed, cause a machine to at least select aplurality of training items based on a toggle rate of training dataitems in a training data set, perform neural network training using theselected plurality of training data items and stored training parametersto determine training results, synchronize the training results withother nodes involved in distributed training, and update stored trainingparameters based on the synchronized training results.

Example 9 includes the non-transitory computer readable medium ofexample 8, wherein the instructions, when executed, cause the machine toat least determine a toggle rate for each item of training data in thetraining data set, sort the items in the training data by theircorresponding toggle rate, allocate a first number of items of thesorted items to a first group, and allocate a second number of items ofthe sorted items to a second group, the second number of items beingsequentially located after the first number of items in the sortedtraining data.

Example 10 includes the non-transitory computer readable medium ofexample 9, further including shuffling the items allocated to the firstgroup within the first group.

Example 11 includes the non-transitory computer readable medium of anyone of examples 9-10, wherein the toggle rate associated with a trainingdata item represents an amount of data variance within the trainingitem.

Example 12 includes the non-transitory computer readable medium ofexample 8, wherein the instructions, when executed, cause the machine toat least determine an average toggle rate of the items in the pluralityof training items, select a power state based on the average togglerate, and apply the selected power state to the machine.

Example 13 includes the non-transitory computer readable medium ofexample 8, wherein the training data items include image data.

Example 14 includes a method for distributed training of neuralnetworks, the method comprising selecting, by executing an instructionwith a processor of a node, a plurality of training items based on atoggle rate of training data items in a training data set, performing,by executing an instruction with the processor of the node, neuralnetwork training using the selected plurality of training data items andstored training parameters to determine training results, synchronizingthe training results with other nodes involved in distributed training,and updating stored training parameters based on the synchronizedtraining results.

Example 15 includes the method of example 14, wherein the selecting ofthe plurality of training items includes determining a toggle rate foreach item of training data in the training data set, sorting the itemsin the training data by their corresponding toggle rate, allocating afirst number of items of the sorted items to a first group, andallocating a second number of items of the sorted items to a secondgroup, the second number of items being sequentially located after thefirst number of items in the sorted training data.

Example 16 includes the method of example 15, further includingshuffling the items allocated to the first group within the first group.

Example 17 includes the method of any one of examples 14-16, wherein thetoggle rate associated with a training data item represents an amount ofdata variance within the training item.

Example 18 includes the method of example 14, further includingdetermining, by executing an instruction with the processor of the node,an average toggle rate of the items in the plurality of training items,selecting, by executing an instruction with the processor of the node, apower state based on the average toggle rate, and applying, by executingan instruction with the processor of the node, the selected power stateto the node.

Example 19 includes an apparatus for distributed training of neuralnetworks, the apparatus comprising means for selecting, by executing aninstruction with a processor of a node, a plurality of training itemsbased on a toggle rate of training data items in a training data set,means for performing, by executing an instruction with the processor ofthe node, neural network training using the selected plurality oftraining data items and stored training parameters to determine trainingresults, means for synchronizing the training results with other nodesinvolved in distributed training, and means for updating stored trainingparameters based on the synchronized training results.

Example 20 includes the apparatus of example 19, further including meansfor determining a toggle rate for each item of training data in thetraining data set, means for sorting the items in the training data bytheir corresponding toggle rate, means for allocating a first number ofitems of the sorted items to a first group, and means for allocating asecond number of items of the sorted items to a second group, the secondnumber of items being sequentially located after the first number ofitems in the sorted training data.

Example 21 includes the apparatus of example 20, further including meansfor shuffling the items allocated to the first group within the firstgroup.

Example 22 includes the apparatus of any one of examples 19-21, whereinthe toggle rate associated with a training data item represents anamount of data variance within the training item.

Example 23 includes an apparatus for distributed training of neuralnetworks, the apparatus comprising a neural network trainer to select aplurality of training data items, a power state controller to control anamount of power consumption of a central processing unit of theapparatus based on an average toggle rate of the training data items inthe selected plurality of training data items, a neural networkparameter memory to store neural network training parameters, and aneural network processor to implement a neural network to generatetraining data results using the selected plurality of training dataitems and the neural network training parameters, the neural networktrainer to synchronize the training results with other nodes involved indistributed training and update the neural network training parametersstored in the neural network parameter memory based on the synchronizedtraining results.

Example 24 includes the apparatus of example 23, wherein the power statecontroller is to control the amount of power consumption of the centralprocessing unit by setting a power state of the central processing unit.

Example 25 includes the apparatus of any one of examples 23-24, whereinthe toggle rate represents an amount of data variance within a trainingdata item.

Example 26 includes the apparatus of example 23, wherein the neuralnetwork is implemented as a deep neural network.

Example 27 includes the apparatus of example 26, wherein the deep neuralnetwork includes at least one convolutional layer.

Example 28 includes the apparatus of example 23, wherein the trainingdata items include image data.

Example 29 includes a non-transitory computer readable medium comprisinginstructions which, when executed, cause a machine to at least select aplurality of training data items of training items, determine an averagetoggle rate of the training items in the plurality of training dataitems, select a power state based on the average toggle rate, apply theselected power state to a central processing unit of the machine,perform neural network training using the selected plurality of trainingdata items and stored training parameters to determine training results,synchronize the training results with other machines involved indistributed training, and update stored training parameters based on thesynchronized training results.

Example 30 includes the non-transitory computer readable medium ofexample 29, wherein the toggle rate represents an amount of datavariance within a training item.

Example 31 includes the non-transitory computer readable medium ofexample 29, wherein the training items include image data.

Example 32 includes a method for distributed training of neuralnetworks, the method comprising selecting, by executing an instructionwith a processor of a node, a plurality of training data items oftraining items, determining, by executing an instruction with theprocessor of the node, an average toggle rate of the items in theplurality of training data items, selecting, by executing an instructionwith the processor of the node, a power state based on the averagetoggle rate, applying, by executing an instruction with the processor ofthe node, the selected power state to the node, performing, by executingan instruction with the processor of the node, neural network trainingusing the selected plurality of training data items and stored trainingparameters to determine training results, synchronizing the trainingresults with other nodes involved in the distributed training, andupdating stored training parameters based on the synchronized trainingresults.

Example 33 includes the method of example 32, wherein the toggle raterepresents an amount of data variance within a training item.

Example 34 includes an apparatus for distributed training of neuralnetworks, the apparatus comprising means for selecting, at a node, aplurality of training data items of training items, means fordetermining an average toggle rate of the items in the plurality oftraining data items, means for selecting a power state based on theaverage toggle rate, means for applying the selected power state to thenode, means for performing neural network training using the selectedplurality of training data items and stored training parameters todetermine training results, means for synchronizing the training resultswith other nodes involved in the distributed training, and means forupdating stored training parameters based on the synchronized trainingresults.

Example 35 includes the apparatus of example 34, wherein the toggle raterepresents an amount of data variance within a training item.

Although certain example methods, apparatus and articles of manufacturehave been disclosed herein, the scope of coverage of this patent is notlimited thereto. On the contrary, this patent covers all methods,apparatus, and articles of manufacture fairly falling within the scopeof the claims of this patent.

What is claimed is:
 1. An apparatus for distributed training of neuralnetworks, the apparatus comprising: a neural network trainer to select aplurality of training data items from a training data set based on atoggle rate of each item in the training data set; a neural networkparameter memory to store neural network training parameters; and aneural network processor to implement a neural network to generatetraining data results from distributed training over a plurality ofnodes of the neural network using the selected training data items andthe neural network training parameters, the neural network trainer tosynchronize the training data results and to update the neural networktraining parameters based on the synchronized training data results. 2.The apparatus of claim 1, further including a power state controller tocontrol an amount of power consumption of a central processing unit ofthe apparatus based on an average toggle rate of the training data itemsin the selected training data items.
 3. The apparatus of claim 1,further including: a toggle rate identifier to determine a toggle ratefor each item of training data in the training data set; a training datasorter to sort the items in the training data by their correspondingtoggle rate; and a training data grouper to allocate a first number ofitems of the sorted items to a first group, and to allocate a secondnumber of items of the sorted items to a second group, the second numberof items being sequentially located after the first number of items inthe sorted training data, the selection of the plurality of trainingdata items being performed among the first group and the second group.4. The apparatus of claim 3, wherein the training data grouper isfurther to shuffle the items allocated to the first group within thefirst group.
 5. The apparatus of claim 1, wherein the toggle rateassociated with a training data item represents an amount of datavariance within the training data item.
 6. The apparatus of claim 1,wherein the neural network is implemented as a deep neural network. 7.The apparatus of claim 1, wherein the training data items include imagedata.
 8. A non-transitory computer readable medium comprisinginstructions which, when executed, cause a machine to at least: select aplurality of training items based on a toggle rate of training dataitems in a training data set; perform neural network training using theselected plurality of training data items and stored training parametersto determine training results; synchronize the training results withother nodes involved in distributed training; and update stored trainingparameters based on the synchronized training results.
 9. Thenon-transitory computer readable medium of claim 8, wherein theinstructions, when executed, cause the machine to at least: determine atoggle rate for each item of training data in the training data set;sort the items in the training data by their corresponding toggle rate;allocate a first number of items of the sorted items to a first group;and allocate a second number of items of the sorted items to a secondgroup, the second number of items being sequentially located after thefirst number of items in the sorted training data.
 10. Thenon-transitory computer readable medium of claim 9, further includingshuffling the items allocated to the first group within the first group.11. The non-transitory computer readable medium of claim 9, wherein thetoggle rate associated with a training data item represents an amount ofdata variance within the training item.
 12. The non-transitory computerreadable medium of claim 8, wherein the instructions, when executed,cause the machine to at least: determine an average toggle rate of theitems in the plurality of training items; select a power state based onthe average toggle rate; and apply the selected power state to themachine.
 13. The non-transitory computer readable medium of claim 8,wherein the training data items include image data.
 14. A method fordistributed training of neural networks, the method comprising:selecting, by executing an instruction with a processor of a node, aplurality of training items based on a toggle rate of training dataitems in a training data set; performing, by executing an instructionwith the processor of the node, neural network training using theselected plurality of training data items and stored training parametersto determine training results; synchronizing the training results withother nodes involved in distributed training; and updating storedtraining parameters based on the synchronized training results.
 15. Themethod of claim 14, wherein the selecting of the plurality of trainingitems includes: determining a toggle rate for each item of training datain the training data set; sorting the items in the training data bytheir corresponding toggle rate; allocating a first number of items ofthe sorted items to a first group; and allocating a second number ofitems of the sorted items to a second group, the second number of itemsbeing sequentially located after the first number of items in the sortedtraining data.
 16. The method of claim 15, further including shufflingthe items allocated to the first group within the first group.
 17. Themethod of claim 14, wherein the toggle rate associated with a trainingdata item represents an amount of data variance within the trainingitem.
 18. The method of claim 14, further including: determining, byexecuting an instruction with the processor of the node, an averagetoggle rate of the items in the plurality of training items; selecting,by executing an instruction with the processor of the node, a powerstate based on the average toggle rate; and applying, by executing aninstruction with the processor of the node, the selected power state tothe node.